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Abstract 

RNA-sequencing (RNA-Seq) has become the technology of choice for 
whole-transcriptome profiling. However, processing the millions of se- 
quence reads generated requires considerable bioinformatics skills and 
computational resources. At each step of the processing pipeline many 
tools are available, each with specific advantages and disadvantages. While 
using a specific combination of tools might be desirable, integrating the 
different tools can be time consuming, often due to specificities in the 
formats of input/output files required by the different programs. 

Here we present iRAP, an integrated RNA-seq analysis pipeline that 
allows the user to select and apply their preferred combination of existing 
tools for mapping reads, quantifying expression, testing for differential 
expression. iRAP also includes multiple tools for gene set enrichment 
analysis and generates web browsable reports of the results obtained in 
the different stages of the pipeline. Depending upon the application, iRAP 
can be used to quantify expression at the gene, exon or transcript level. 
iRAP is aimed at a broad group of users with basic bioinformatics training 
and requires little experience with the command line. Despite this, it also 
provides more advanced users with the ability to customise the options 
used by their chosen tools. 

iRAP is available under General Public License 3 (GPLv3) and al- 
though it should be portable to any POSIX-compliant operating system, 
several third party programs only run on Linux. iRAP can be obtained 
from http : / / code . google . com/p/ irap. 

1 Introduction 

A typical RNA-seq analysis pipeline involves several stages (Fig. 1), starting 
with processing of the raw reads and moving ultimately to identification of 

*to whom correspondence should be addressed 
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Figure 1: iRAP: Main steps in the analysis of RNA-seq data. 



differentially expressed genes. At each step of the pipeline the user is faced with 
choosing one of many existing tools. However, efficiently integrating a chosen 
set of tools is typically not straightforward, due primarily to differences in the 
input file formats required by each tool. In an attempt to overcome this problem, 
several pipelines for analyzing RNA-seq data have been developed (see Table 1 
in Appendix for a summary). Grape [10] exploits a single mapper with different 
mapping strategies before using FluxCapacitor for quantification. Array express 
HTS [6], an R based pipeline, performs the mapping using BWA [13], Bowtie [12] 
or Tophat [22] before using Cufflinks [23] or MMSeq [25] to quantify expression 
levels. Finally, Galaxy [5] is a general purpose web-based platform that allows 
users to analyse different types of HTS data - for RNA-seq, Galaxy supports 
the Tuxedo pipeline (Tophat, Cufflinks and Cuffdiff). 

Whilst useful, the pipelines mentioned above are all restricted to a small 
number of options at each stage, meaning that the user is unable to easily 
combine different sets of tools, which may be critical in certain settings. Here we 
present iRAP (an integrated RNA-seq Analysis Pipeline) that allows the user to 
choose the tool to be used on each stage of the analysis from amongst the many 
supported tools for mapping, quantification, differential expression analysis, and 
gene set enrichment analysis. iRAP performs the analysis from the raw reads 
up to the identification of differentially expressed genes or enriched set of genes 
with a single command. iRAP can also generate web reports to facilitate the 
inspection of the results produced at different stages of the analysis. 

2 About iRAP 

iRAP requires as input the FASTQ or BAM files containing the raw sequence 
reads, a reference genome (in FASTA format), gene model annotation (in GTF 
format) and a configuration file. 

In the first step of the pipeline, the quality of the data is assessed and, 
optionally, the reads can be trimmed and those with "bad" quality filtered out. 

The second step of the pipeline aligns the reads to the reference genome 
or transcriptome using one of the many available high-throughput sequencing 
mapping tools [4]. More than ten mappers, both splice aware and splice unaware 
are currently integrated in iRAP: Tophatl [22], Tophat2 [9], Osa [7], Star [3], 
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GSNAP [28], Bowtiel [12], Bowtie2 [11], Smalt \ BWAl [13], BWA2 [14], 
GEM [17] and SOAPsplice [8]. The SAM/BAM files produced by the map- 
pers are post-processed to ensure interoperability. This is necessary since the 
SAM/BAM files generated by some programs do not include all BAM fields 
required by downstream quantification tools (see appendix for more details) . 

The alignments generated are subsequently passed to one of four quantifica- 
tion tools to obtain a measure of expression over a biologically meaningful unit, 
such as exons, isoforms, or genes using Cufflinks [23, 24], FluxCapacitor [18], 
NURD [16], or HTSeq [2]. 

To identify differentially expressed genes, iRAP allows the user to select from 
one of the following tools: Cuffdiffl [23], Cuffdiff2 [21], DESeq [1] or EdgeR [19]. 
Gene set enrichment (GSE) analysis can be performed using one of the many 
gene set analysis methods available in the Piano R package [26] . Finally, iRAP 
allows the user to quickly explore the results of the analysis through web brows- 
able reports for each different stages of the analysis (QC, mapping, quantifi- 
cation, DE and GSE). Further details about the pipeline are provided in the 
Appendix. 

3 Discussion 

iRAP has a number of practical advantages. First it allows the practitioner to 
customise the analysis pipeline to the needs of a given project (e.g., question 
addressed, characteristics of the data, collaborators requests, etc) and change 
it easily as required. If there is uncertainty about the optimal pipeline setup, 
iRAP allows different options to be easily compared. 

Additionally, iRAP is user friendly while maintaining a high level of flexibil- 
ity. For instance, invoking iRAP clS irap conf =myexp . conf mapper=star quant_method=htseq 
de_method=deseq would use Star as the mapper, HTSeq as the quantification tool 
and DESeq for differential expression. Information about the data is passed 
to iRAP in a configuration file (myexp.conf in the above example). Flexibility 
is provided by letting the user: i) select the tool to use on each stage of the 
pipeline; ii) stop and re-start the analysis pipeline at any stage; and iii) pass 
specific parameters to the tools used. Finally, iRAP can exploit multi-processor 
computers by defining the number of cores that can be used; the whole pipeline 
can also be executed in parallel on LINUX clusters that have a Load Sharing 
Facility (LSF). In the future we plan to integrate more methods in the pipeline 
to offer more alternatives and to address other biological questions (e.g., fusion 
detection, exon or isoform-level differential expression). 
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A iRAP details 

iRAP requires as input the FASTQ files generated from an RNA-seq experiment, 
a reference genome (in FASTA format), gene model annotation (in GTF format) 
and an iRAP configuration file. The configuration file can contain options to 
run iRAP as well as information about the FASTQ files (read length, quality 
encoding, etc), experimental design (for differential expression analysis) and the 
location of the data (FASTQ files, genome and gene model). 

The pipeline is controlled by a main GNU Make file, which supports a variety 
of programs for sequence alignment, quantification, and differential expression. 
The software also includes many utilities, custom scripts and programs that are 
used by the main GNU Make file to pre-process and post-process the many files 
generated when the pipeline is used. 

iRAP avoids processing the same data set multiple times whenever possible. 
To do this, every time that iRAP is applied to the same input files, the software 
checks whether the files produced at each stage of the pipeline already exist 
and, if they do and are consistent, the existing files will be used instead of being 
recomputed. This is particularly useful since iRAP can be run up a particular 
stage (e.g., the end of the mapping) and stopped before re-starting the analysis. 
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A.l QC 

Before aligning the reads, iRAP can optionally check the quality of the data by 
invoking FastQC 2 . If the user so wishes, reads can be trimmed and those with 
bad "quality" excluded from the analysis. The filtering is performed in four 
steps: 

a) Base quality: the quality of the reads is assessed using the FASTX 3 
toolkit. The reads are processed as follows: i) reads with an overall per- 
centage (defined by the user) of the bases below some quality (default is 10) 
can be excluded; ii) bases at the ends of a read with a base quality below 
a user defined threshold can be trimmed; if the a read's length decreases 
by more than 15%, it is discarded; iii) reads containing artifacts can be 
removed, where artifacts are defined using the f astx_artif acts_f ilter. 

b) Contamination: reads that likely originate from organisms other than the 
one under study can be discarded. This is done by quickly aligning reads 
to the genomes of organisms that might be a source of contamination and 
those that map with a high degree of fidelity are discarded. 

c) Uncalled bases: reads with uncalled bases (bases where the sequencer was 
unable to determine the base and assigned it an 'N') are excluded. 

d) Unpaired reads: when reads are generated using a paired-end protocol, 
one of the pair can be lost in an earlier analysis step. These reads are 
placed into a singletons file, since most downstream analysis tools require 
paired end-reads to be separated into two FASTQ files, where the pairs 
are listed in the same order. Unpaired reads are assigned to a third file. 

Lastly, the filtered reads are processed using FastQC to generate an HTML 
report of the quality of the data. 

A. 2 Mapping 

Aligning the (quality filtered) reads to the reference genome is a key step in 
the analysis of RNA-seq data. iRAP allows the user to select from one of the 
following mappers: Tophat 1 [22], Tophat 2 [9], Osa [7], Star [3], GSNAP [28], 
Bowtie2 [11], Bowtiel [12], Smalt 4 , BWA1 [13], BWA2 [14], GEM [17] and 
SOAPsplice [8]. Other mappers can also be included if they output aligned 
reads in the SAM or BAM format. Most HTS mappers require the reference 
genome to be indexed [4] . This step is automatically performed by iRAP for the 
selected mapper. Moreover, some HTS mappers also require a file with splice 
junctions to be inputted. Again, iRAP automatically generates this file using 
the reference and annotation files provided as input. 

2 FastQC: http : //www .bioinf ormatics . babraham. ac .uk/projects/f astqc 
3 FASTX: http : //hannonlab . cshl . edu/f astx_toolkit/ 
4 Smalt: http : //www . Sanger . ac.uk/resources/software/smalt/ 
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The SAM/BAM files produced by the mappers are post-processed in two 
steps. First, to ensure interoperability, the SAM/BAM files are processed to 
include, if necessary, fields that the mappers failed to generate, but that may 
be required by downstream quantification tools (see Table 2 and Table 3). 

For each input library, the output of the mapping step is three files: a) a 
BAM file with the alignments sorted by chromosomal coordinates; b) a BAM 
file with the alignments sorted by read name; c) a BAM index for the file sorted 
by chromosomal coordinates. These files are generated using SamTools [15]. 
Obviously, generating two BAM files for the same data set is wasteful but this 
is necessary since some downstream tools require alignments to be sorted in 
different ways. 

A. 3 Quantification 

After mapping, the next task is to summarise and aggregate reads over a biolog- 
ically meaningful unit, such as exons, isoforms, or genes. iRAP supports several 
quantification tools: HTSeq 5 , Cufflinks 1 [23], Cufflinks 2 [24], NURD [16], and 
Flux-capacitor [18]. Subsequently, iRAP processes the output files to generate 
a matrix with the number of reads per biological unit and per BAM files. De- 
pending upon the quantification tool selected, iRAP produces a TSV file with 
the read count matrix per gene, exon or isoform. 

A. 4 Differential Expression 

iRAP allows the user to identify differentially expressed genes using one of 
the following tools: Cuffdiff 1 [23], Cuffdiff 2 [21], DESeq [1] or EdgeR [19]. 
Current strategies for integrated DE analysis arc limited to simple experiment 
designs, such as pairwise comparisons. For more complex designs, such as paired 
samples or time course experiments, the user needs to get the data and apply 
an appropriate alternative tool. 

A. 5 Gene set enrichment analysis 

Gene set enrichment (GSE) analysis can be performed using the (adjusted) p- 
values and fold-change computed for each gene during the differential expression 
analysis. iRAP performs GSE analysis using the Piano R package [26] which 
provides support for several statistical methods (e.g, fisher, stouffer, . . . ). The 
output of a GSE analysis will be a TSV file with gene set statistics and p- values. 

A. 6 Web reports 

iRAP allows the user to quickly explore the results of the analysis through 
web browsable reports for each different stages of the analysis (QC, mapping, 
quantification, DE and GSE). For each stage of the analysis one or several web 
pages are generated with tables and plots. 

5 HTSeq: http : //www-huber . embl.de/users/anders/HTSeq/ 
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B Installation and usage 

iRAP integrates many third party tools, all of which need to be installed. To 
facilitate this task, iRAP includes an install script to compile and install all 
third-party methods and tools in a user defined folder. A Virtual Machine 
is also provided with iRAP and all dependencies pre-inst ailed. Although the 
pipeline should be portable to any POSIX-compliant operating system, several 
third party methods only run on Linux. As a result, iRAP can only be installed 
on a Linux operating system with development libraries installed — these will be 
needed to compile and install Perl, R, Ruby and other utilities in a folder defined 
by the user. No administration access to the operating system is required. 

As mentioned above, iRAP requires as input FASTQ files containing the 
reads, a reference genome (in FASTA format), gene model annotation (in GTF 
format) and a configuration file. Then, assuming that the configuration file is 
called myexp . conf , the pipeline can be started by invoking iRAP: irap conf =myexp . conf . 
The specific combination of tools required can be easily specified from the com- 
mand line. For instance, the Tuxedo pipeline can be defined as irap conf =myexp . conf 
mapper=topnati quant_method=cuf f unksi de_method=cuf f diff i. Alternatively, the combina- 
tion of Star, HTSeq and DESeq can be specified as irap conf =myexp . conf mapper=star 
quant _method=htseq de_method=deseq. 

It is possible to run iRAP only up to some stage of the pipeline. For instance, 
irap conf =myexp . conf mapping will run the pipeline up to the mapping stage. 

Processing RNA-sequencing data is often computationally demanding both 
in terms of memory requirements and in CPU time. To reduce the execution 
time iRAP can exploit the multiple processors and cores available in a modern 
computer as well as utilising a Linux cluster. Furthermore, iRAP exploits the 
Platform LSF job scheduler by splitting the analysis into multiple jobs with the 
aim of reducing the time to produce the results. From the user perspective, this 
can be achieved without changing the parameters, input files or configuration 
file. 

Finally, it is important to keep track of the software used and the respec- 
tive versions. iRAP provides this information alongside a bibliographic citation 
of the tools used (if available) via a simple command (e.g. ,irap conf =myexp . conf 
show.citations). Furthermore, iRAP provides all commands executed along with 
the input and output files providing the user with a log of all commands exe- 
cuted. 
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Pipeline 


QC/Filtcr 


Mappers 


Quantification 


DE 


GSE 


Web Reports 


Galaxy (RNA-seq) [5] 


Yes/Yes 


1* 


1 


1 


No 


No 


ArrayExpress HTS [6] 


Yes/No 


3 


2 


No 


No 


Yes 


Grape [10] 


Yes/No 


1 


1 


No 


No 


Yes 


PRADA [20] 


No/No 


1 


1 


No 


No 


No 


RSeqFlow [27] 


No/No 


2 


1 


2 


No 


No 


iRAP 


Yes/ Yes 


11 


5 


4 


Yes 


Yes 



Table 1: iRAP comparison to other RNA-seq analysis pipelines. QC/Filtcr: 
indicates whether pipelines (can) apply quality control / allow reads to be fil- 
tered. Mappers: the number of mappers supported; Quantification: the number 
of quantification methods supported. DE: the number of differential expression 
analysis methods supported. GSE: whether the pipeline integrates gene set en- 
richment analysis. Web reports: indicates if pipelines produce a web report for 
each stage of the analysis. *Mappers part of the RNA-seq analysis in Galaxy. 



Method 


NH 


XSA 


Splicing (Cigar) 


HTSeq 


Yes 


No 


No 


Cufflinks 1 and 2 


Yes 


Yes 


Yes 


Flux-Capacitor 


Yes 


Yes 


Yes 


NURD 


No 


No 


No 



Table 2: SAM/BAM fields used by different quantification methods. The NH 
field contains information about the number of alignments associated with a 
read; the XSA field contains the splice strand orientation; and Splicing indicates 
whether splicing is reported in the cigar string. 



Mappers 


NH 


XS:A 


Splicing (cigar) 


Tophat 1 


Yes 


Yes 


Yes 


Tophat 2 


Yes 


Yes 


Yes 


Bowtie 1 


No 


No 


No 


Bowtie 2 


No 


No 


No 


Smalt 


No 


No 


No 


Gsnap 


Yes 


Yes 


Yes 


Soapsplice 


No 


No 


Yes 


BWA 1 


No 


No 


No 


BWA 2 


No 


No 


No 


Gem 


Yes 


No 


No 


OSA 


Yes 


Yes 


Yes 


Star 


Yes 


Yes 


Yes 



Table 3: SAM/BAM fields generated by the different mappers (Yes - generated, 
No - not generated) . 
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